{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 对活动数据进行分析\n",
    "（只取训练集和测试集中出现的样本）\n",
    "\n",
    "数据来源于Kaggle竞赛：Event Recommendation Engine Challenge，根据\n",
    "    events they’ve responded to in the past\n",
    "    user demographic information\n",
    "    what events they’ve seen and clicked on in our app\n",
    "预测用户对某个活动是否感兴趣\n",
    "\n",
    "竞赛官网：\n",
    "https://www.kaggle.com/c/event-recommendation-engine-challenge/data\n",
    "\n",
    "\n",
    "活动描述信息在events.csv文件：共110维特征\n",
    "前9列：event_id, user_id, start_time, city, state, zip, country, lat, and lng.\n",
    "event_id：id of the event, \n",
    "user_id：id of the user who created the event.  \n",
    "city, state, zip, and country： more details about the location of the venue (if known).\n",
    "lat and lng： floats（latitude and longitude coordinates of the venue）\n",
    "start_time： 字符串，ISO-8601 UTC time，表示活动开始时间\n",
    "\n",
    "后101列为词频：count_1, count_2, ..., count_100，count_other\n",
    "count_N：活动描述出现第N个词的次数\n",
    "count_other：除了最常用的100个词之外的其余词出现的次数\n",
    "\n",
    "这里我们用count_1, count_2, ..., count_100，count_other属性做聚类，即活动用这些关键词来描述，可表示活动的类别"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 导入工具包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "#数据量太大，pdandas不能一次讲所有数据读入\n",
    "#也可以用pandas,一次读取部分数据，可以参考：https://www.cnblogs.com/datablog/p/6127000.html\n",
    "#import pandas as pd\n",
    "\n",
    "import numpy as np\n",
    "import scipy.sparse as ss\n",
    "import scipy.io as sio\n",
    "\n",
    "#保存数据\n",
    "import pickle as cPickle\n",
    "\n",
    "#event的特征需要编码\n",
    "from utils import FeatureEng\n",
    "from sklearn.preprocessing import normalize\n",
    "#相似度/距离\n",
    "import scipy.spatial.distance as ssd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 统计活动数目"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of records :3137972\n"
     ]
    }
   ],
   "source": [
    "#读取数据，并统计有多少不同的events\n",
    "#其实EDA.ipynb中用read_csv已经统计过了\n",
    "lines = 0\n",
    "fin = open(\"events.csv\", 'rb')\n",
    "#找到用C/C++的感觉了\n",
    "#字段：event_id, user_id,start_time, city, state, zip, country, lat, and lng， 101 columns of words count\n",
    "fin.readline() # skip header，列名行\n",
    "for line in fin:\n",
    "    cols = line.decode().strip().split(\",\")\n",
    "    lines += 1\n",
    "fin.close()\n",
    "\n",
    "print(\"number of records :%d\" % lines)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "活动数目太多（300w+），训练+测试集的活动没这么多，所有先去处理train和test，得到竞赛需要用到的活动和用户\n",
    "然后对在训练集和测试集中出现过的活动和用户建立新的ID索引\n",
    "先运行user_event.ipynb,\n",
    "得到活动列表文件：PE_eventIndex.pkl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 读取之前算好的测试集和训练集中出现过的活动\n",
    "详见user_event.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "number of events in train & test :13418\n"
     ]
    }
   ],
   "source": [
    "#读取训练集和测试集中出现过的活动列表\n",
    "eventIndex = cPickle.load(open(\"PE_eventIndex.pkl\", 'rb'))\n",
    "n_events = len(eventIndex)\n",
    "\n",
    "print(\"number of events in train & test :%d\" % n_events)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 处理events.csv --> 特征编码、活动之间的相似度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "FE = FeatureEng()\n",
    "\n",
    "fin = open(\"events.csv\", 'rb')\n",
    "\n",
    "#字段：event_id, user_id,start_time, city, state, zip, country, lat, and lng， 101 columns of words count\n",
    "fin.readline() # skip header\n",
    "\n",
    "#start_time, city, state, zip, country, lat, and lng\n",
    "eventPropMatrix = ss.dok_matrix((n_events, 7))\n",
    "\n",
    "#词频特征\n",
    "eventContMatrix = ss.dok_matrix((n_events, 101))\n",
    "\n",
    "for line in fin.readlines():\n",
    "    cols = line.decode().strip().split(\",\")\n",
    "    eventId = str(cols[0])\n",
    "    \n",
    "    #if eventIndex.has_key(eventId):  #在训练集或测试集中出现\n",
    "    if eventId in eventIndex:  #在训练集或测试集中出现\n",
    "        i = eventIndex[eventId]\n",
    "        #event的特征编码，这里cols[4]只是简单处理，其实开始时间，地点等信息很重要\n",
    "        eventPropMatrix[i, 0] = FE.getJoinedYearMonth(cols[2]) # start_time\n",
    "        eventPropMatrix[i, 1] = FE.getFeatureHash(cols[3].encode(\"utf8\")) # city\n",
    "        eventPropMatrix[i, 2] = FE.getFeatureHash(cols[4].encode(\"utf8\")) # state\n",
    "        eventPropMatrix[i, 3] = FE.getFeatureHash(cols[5].encode(\"utf8\")) # zip\n",
    "        eventPropMatrix[i, 4] = FE.getFeatureHash(cols[6].encode(\"utf8\")) # country\n",
    "        eventPropMatrix[i, 5] = FE.getFloatValue(cols[7]) # lat\n",
    "        eventPropMatrix[i, 6] = FE.getFloatValue(cols[8]) # lon\n",
    "        \n",
    "        #词频\n",
    "        for j in range(9, 110):\n",
    "            eventContMatrix[i, j-9] = cols[j]\n",
    "fin.close()\n",
    "\n",
    "#用L2模归一化\n",
    "eventPropMatrix = normalize(eventPropMatrix,\n",
    "    norm=\"l2\", axis=0, copy=False)\n",
    "sio.mmwrite(\"EV_eventPropMatrix\", eventPropMatrix)\n",
    "\n",
    "#词频，可以考虑我们用这部分特征进行聚类，得到活动的genre\n",
    "eventContMatrix = normalize(eventContMatrix,\n",
    "    norm=\"l2\", axis=0, copy=False)\n",
    "sio.mmwrite(\"EV_eventContMatrix\", eventContMatrix)\n",
    "\n",
    "\n",
    "# calculate similarity between event pairs based on the two matrices\n",
    "eventPropSim = ss.dok_matrix((n_events, n_events))\n",
    "eventContSim = ss.dok_matrix((n_events, n_events))\n",
    "\n",
    "#读取在测试集和训练集中出现的活动对\n",
    "uniqueEventPairs = cPickle.load(open(\"PE_uniqueEventPairs.pkl\", 'rb'))\n",
    "\n",
    "for e1, e2 in uniqueEventPairs:\n",
    "    #i = eventIndex[e1]\n",
    "    #j = eventIndex[e2]\n",
    "    i = e1\n",
    "    j = e2\n",
    "    \n",
    "    #非词频特征，采用Person相关系数作为相似度\n",
    "    if not (i,j) in eventPropSim:\n",
    "        epsim = ssd.correlation(eventPropMatrix.getrow(i).todense(),\n",
    "            eventPropMatrix.getrow(j).todense())\n",
    "        \n",
    "        eventPropSim[i, j] = epsim\n",
    "        eventPropSim[j, i] = epsim\n",
    "    \n",
    "    #对词频特征，采用余弦相似度，也可以用直方图交/Jacard相似度\n",
    "    if not (i,j) in eventContSim:\n",
    "        ecsim = ssd.cosine(eventContMatrix.getrow(i).todense(),\n",
    "            eventContMatrix.getrow(j).todense())\n",
    "    \n",
    "        eventContSim[i, j] = epsim\n",
    "        eventContSim[j, i] = epsim\n",
    "    \n",
    "sio.mmwrite(\"EV_eventPropSim\", eventPropSim)\n",
    "sio.mmwrite(\"EV_eventContSim\", eventContSim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[0., 0., 0., ..., 0., 0., 0.]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eventPropSim.getrow(0).todense()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
